A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine
نویسندگان
چکیده
Several systems that rely on consistent data to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations in developing methods for removing replicas from its data repositories.In this paper, we have proposed accordingly. In the previous work, duplicate record detection was done using three different similarity measures and neural network. In the previous work, we have generated feature vector based on similarity measures and then, neural network was used to find the duplicate records. In this paper, we have developed Q-gram concept with support vector machine for deduplication process. The similarity function, which we are used Dice coefficient,Damerau–Levenshtein distance,Tversky index for similarity measurement. Finally, support vector machine is used for testing whether data record is duplicate or not. A set of data generated from some similarity measures are used as the input to the proposed system. There are two processes which characterize the proposed deduplication technique, the training phase and the testing phase the experimental results showed that the proposed deduplication technique has higher accuracy than the existing method. The accuracy obtained for the proposed deduplication 88%.
منابع مشابه
A Wavelet Support Vector Machine Combination Model for Daily Suspended Sediment Forecasting
Abstract In this study, wavelet support vector machine (WSWM) model is proposed for daily suspended sediment (SS) prediction. The WSVM model is achieved by combination of two methods; discrete wavelet analysis and support vector machine (SVM). The developed model was compared with single SVM. Daily discharge (Q) and SS data from Yadkin River at Yadkin College, NC station in the USA were used. I...
متن کاملSupport vector regression with random output variable and probabilistic constraints
Support Vector Regression (SVR) solves regression problems based on the concept of Support Vector Machine (SVM). In this paper, a new model of SVR with probabilistic constraints is proposed that any of output data and bias are considered the random variables with uniform probability functions. Using the new proposed method, the optimal hyperplane regression can be obtained by solving a quadrati...
متن کاملFault diagnosis in a distillation column using a support vector machine based classifier
Fault diagnosis has always been an essential aspect of control system design. This is necessary due to the growing demand for increased performance and safety of industrial systems is discussed. Support vector machine classifier is a new technique based on statistical learning theory and is designed to reduce structural bias. Support vector machine classification in many applications in v...
متن کاملA New Formulation for Cost-Sensitive Two Group Support Vector Machine with Multiple Error Rate
Support vector machine (SVM) is a popular classification technique which classifies data using a max-margin separator hyperplane. The normal vector and bias of the mentioned hyperplane is determined by solving a quadratic model implies that SVM training confronts by an optimization problem. Among of the extensions of SVM, cost-sensitive scheme refers to a model with multiple costs which conside...
متن کاملOnline Voltage Stability Monitoring and Prediction by Using Support Vector Machine Considering Overcurrent Protection for Transmission Lines
In this paper, a novel method is proposed to monitor the power system voltage stability using Support Vector Machine (SVM) by implementing real-time data received from the Wide Area Measurement System (WAMS). In this study, the effects of the protection schemes on the voltage magnitude of the buses are considered while they have not been investigated in previous researches. Considering overcurr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013